Q1

Which 2 city names show up in the dataset more than twice?

# Use dplyr to find city names that appear more than twice
df <- data.frame(cities_tidy)
df %>% select(CityName) %>% group_by(CityName) %>% filter(n()>2)
## # A tibble: 6 x 1
## # Groups:   CityName [2]
##   CityName   
##   <chr>      
## 1 Bloomington
## 2 Bloomington
## 3 Bloomington
## 4 Springfield
## 5 Springfield
## 6 Springfield

Bloomington and Springfield show up in the dataset 3 times each.


Q2

What is the rate of cancer in Springfield, IL? Look at the metadata I’ve provided. Write out in plain English what this number means.

# Use dplyr to filter out Springfield, IL, and view the cancer rate
cities_tidy %>% filter(CityName == "Springfield", StateAbbr == "IL") %>% select(Cancer)
## # A tibble: 1 x 1
##   Cancer
##    <dbl>
## 1    6.4

The rate of cancer in adults, excluding skin cancer, in Springfield, IL is 6.4%. This means that 6.4% of all adults over the age of 18 in Springfield, IL have reported having been told by a doctor that they have cancer, excluding skin cancers.


Q3

Which city in Georgia has the highest rate of binge drinking?

# Use dplyr to find which city in the state of GA has the highest binge drinking rate
GAdrunk <- cities_tidy %>% filter(StateAbbr == "GA") %>% select(CityName, BingeDrinking) %>% filter(BingeDrinking==max(BingeDrinking))
GAdrunk
## # A tibble: 1 x 2
##   CityName BingeDrinking
##   <chr>            <dbl>
## 1 Roswell           19.6

Roswell has the highest rate of binge drinking of the listed cities in Georgia.


Q4

What state in the US has the lowest average of binge drinking? Is this what you expected? What’s the sample size in this state? Discuss the limitations of looking only at state means.

options(scipen=999)
# Make df of vars we are interested in, group by state, and arrange in ascending order by Binge Drinking, then use head() to find the row of the lowest value
df2 <- data.frame(cities_tidy$StateAbbr, cities_tidy$BingeDrinking)
mydata <- df2 %>%
  group_by(cities_tidy.StateAbbr) %>%
  summarise(average = round(mean(cities_tidy.BingeDrinking), 2)) %>%
  arrange(average)

head(mydata, 1)
## # A tibble: 1 x 2
##   cities_tidy.StateAbbr average
##   <fct>                   <dbl>
## 1 UT                       11.2
# Sum population of all cities in UT
ss <- cities_tidy %>% filter(StateAbbr == "UT") %>% select(StateAbbr, PopulationCount) %>% summarise(sum = sum(PopulationCount))
ss
## # A tibble: 1 x 1
##      sum
##    <dbl>
## 1 930942
# Find number of cities included in this dataset from Utah
 ut <- cities_tidy %>% filter(StateAbbr=="UT") %>% select(CityName, PopulationCount, StateAbbr) 
 summarise(ut, n())
## # A tibble: 1 x 1
##   `n()`
##   <int>
## 1     9

Utah has the lowest average of binge drinking at just 11.21%. I am not too surprised that the average binge-drinking rates in Utah are low because it is a fairly religious state. More specifically, over half of the state’spopulation identifies as Mormon, which forbids drinking LA Times. The total sample size in this state is 930942 across 9 cities, which is much smaller than some other states. Some limitations of looking only at state means is that we are prevented from seeing which cities in particular have higher or lower averages, and outliers can skew the means. The national medians, therefore, my be more accurate. Additionally, there may be smaller towns and cities not accounted for in this dataset in Utah that struggle more heavily with binge drinking than are not accounted for and are not represented by this state average.


Q5

One of your colleagues is working on a project comparing rates of health insurance across the country. She asks you to compute the average rate of uninsured urban individuals in the southeastern US as compared to the rest of the country. What are the average rates of ‘NoHealthInsurance’ in the southeastern US (defined here as GA, NC, LA, AL, FL, SC), and the rest of the US respectively? Explain in plain English what this means (and doesn’t mean!). What additional information could you use to help your colleague make a more informed decision about the differences in access to health insurance across these two regions?

# Categorize data as southeast and not southeast 
southeast <- cities_tidy %>% filter(StateAbbr %in% c("GA", "NC","LA", "AL", "FL","SC")) %>% select(StateAbbr, NoHealthInsurance) 

the_rest <- cities_tidy %>% filter(!StateAbbr %in% c("GA", "NC", "LA", "AL", "FL", "SC")) %>% select(StateAbbr, NoHealthInsurance) 

# Find averages of uninsured individuals in the southeast and not southeast
(SE_uninsured <- mean(southeast$NoHealthInsurance))
## [1] 20.576
(the_Rest_uninsured <- mean(the_rest$NoHealthInsurance))
## [1] 15.55576

In the Southeast, it is found that on average, nearly 21 percent of adults over the age of 18 do not have health insurance, where as closer to 16 percent of adults in the rest of the U.S. are uninsured. It appears that far more urban individuals in the southeast are lacking health insurance compared to the rest of the country, however, this is hard to tell with such a difference in sample sizes. Regarding income, demographics, education, and age of residents, we could gain a lot more information about accessibility to health care in the cities recorded in the southeast in comparison to the rest of the country. Also, it would be insightful to split up the rest of the country into their respective regions in order to see how the southeast compares to New England versus the Midwest, and so forth. Additionally, because the total number of adults over the age of 65, and therefore qualifying for Medicare, is larger in the rest of the US compared to just the Southeast, it is expected to see a higher rate of insured people spanning the 44 remaining states. Furthermore, gaining information on the status of health insurance among folks in rural areas could also help to gain a better idea of region-wide coverage, or lack thereof, as some regions are dominated by rural and agricultural landscape.


Q6

Create a visualization showing the difference in health insurance across these two regions to your colleague. How would you interpret this graphic to your colleague? Save this plot and upload it as the response to Q6 on Canvas.

# Create categorical variable of "Southeast" and "Not Southeast"
cities_tidy$southeast <- cities_tidy$StateAbbr==c("GA", "AL", "FL", "NC", "SC", "LA")
cities_tidy$the_rest <- cities_tidy$StateAbbr != c("GA", "AL", "FL", "NC", "SC", "LA")
cities_tidy$region <- factor(cities_tidy$southeast, labels=c("Not Southeast", "Southeast"))

# Create box plots to compare both categories
p1 <- ggplot(cities_tidy %>% group_by(region), aes(x=reorder(region, -NoHealthInsurance, na.rm=TRUE), y=NoHealthInsurance, fill=region))+
  geom_boxplot(alpha=0.7)+
  xlab("Region")+
  ylab("Adults >18 Without Health Insurance (%)")+
  ggtitle("Distribution of Uninsured Adults in the \nSoutheast and Non-Southeastern Regions")+
  theme_minimal()+
  theme(legend.position = "none")+
  scale_fill_brewer(palette="Dark2")
  
ggplotly(p1)

From this graphic we can determine that at the surface level, more respondents from southeastern cities are lacking health insurance in comparison to the rest of the United States. We can also see a much larger range in the distribution of uninsured individuals in the rest of the country than the southeast, and much less variation in the southeast. This is likely due to the much smaller sample size observed in the southeast. Using plotly, we can identify each quartile, median, and outlying value. The median in the southeast is 21% and in the rest of the country it is 15%. However, we can see that the maximum amount of reportedly un-insured individuals in the southeast is 24.6%, whereas the max value in the rest of the US is a whopping 43.9%. These plots allow us to best see the distribution and which data points may be pulling the mean in different ways.


Q7

Use ggplot to create a plot that shows the distribution of population size across cities. Add lines representing the median population (red) and the mean value (blue) of population. How might you explain the difference in the mean and median? As always, make the plot look nice. And be sure to explain in plain text what the plot means. Save this plot and upload it as the response to Q7 on Canvas.

options(scipen=999)

df <- data.frame(cities_tidy) %>% select(PopulationCount, CityName) 

mean_pop <- mean(cities_tidy$PopulationCount)
median_pop <- median(cities_tidy$PopulationCount)

p2 <- ggplot(df, aes(x=PopulationCount))+
  geom_histogram(alpha=0.7, fill="darkslategray4", binwidth = 250000)+
  geom_vline(aes(xintercept=mean_pop), colour="#0000FF")+
  geom_vline(aes(xintercept=median_pop), colour="#FF0000")+
  theme_minimal()+
  theme(legend.text=element_blank())+
  xlab("Population Size")+
  ylab("Density")+
  ggtitle("Distribution of Population Size Across the 500 Largest American Cities")
ggplotly(p2)


This histogram shows us an incredibly right-skewed unimodal distribution of the population count across the 500 cities included in the dataset. Most of the cities have a population between about 40,000 and 1 million, with a few outliers ranging between 1 million and 8 million. The blue, vertical line represents the mean population size of the cities, which is 206041.616. This is about 100,000 more than the median value, 106106 represented by the red line. It makes sense that the mean is quite a bit higher than the median, as outliers such as New York with 8 million resident accounted for pull the average upward. The median is a much better measure in this case of the typical city population size.


Q8

How many cities have Mental Health rates more than 2 standard deviations above the national mean?

nationalmean <- mean(cities_tidy$MentalHealth)
nationalsd <- sd(cities_tidy$MentalHealth)

# nationalmean - 2*nationalsd 

# finding how many cities have mental health averages below 9.044
goodmentalhealth <- cities_tidy %>% filter(MentalHealth < nationalmean-2*nationalsd)
summarise(goodmentalhealth, n())
## # A tibble: 1 x 1
##   `n()`
##   <int>
## 1    10

10 cities in the US have Mental Health rates 2 standard deviations above the national mean, with positive mental health.


Q9

Pick a variable of interest to you. Create a visualization to show how the state of Georgia compares to a subset of other states. Save this plot and upload it as the response to Q9 on Canvas.

# variable of interest is high BP
df3 <- cities_tidy %>% select(StateAbbr, HighBloodPressure) %>% filter(StateAbbr %in% c("AZ", "NM", "OK", "TX", "GA"))

# States that make up the Southwest region, according to the US Fish & Wildlife Services, are AZ, NM, OK, TX

# Create boxplots to compare high BP across ga and southwestern states
p3 <- ggplot(df3)+
  geom_boxplot(aes(x=reorder(StateAbbr, desc(HighBloodPressure)), y=HighBloodPressure,fill=StateAbbr))+
  theme_minimal()+
  theme(legend.position = "none")+
  xlab("State Abbreviation")+
  ylab("Adults >18 who reported high BP (%)")+
  ggtitle("Comparing Rates of High Blood Pressure Between \n Georgia and Southwestern States")+
  scale_fill_brewer(palette="BrBG")
ggplotly(p3)
# How many cities in each state?

# cities_tidy %>% filter(StateAbbr=="GA") %>% select(CityName) GA=11
# cities_tidy %>% filter(StateAbbr=="TX") %>% select(CityName) TX = 47
# cities_tidy %>% filter(StateAbbr=="OK") %>% select(CityName) OK = 6
# cities_tidy %>% filter(StateAbbr=="AZ") %>% select(CityName) AZ= 12
# cities_tidy %>% filter(StateAbbr=="NM") %>% select(CityName)  NM= 4


Using the U.S. Fish & Wildlife Service’s definition of the southwest, I chose Texas, Oklahoma, New Mexico, and Arizona to compare the rates of high blood pressure between Georgia and the Southwest. The results of this plot are quite interesting, even though median values all fall within a range of 10%, as Oklahoma has the second-to-fewest cities of the 5 states, and yet has the highest percentage of respondents with high blood pressure. Georgia has the widest range in the distribution, with a median not far off from OK’s, but max and min values spanning much higher and much lower than the other states’. I don’t find it surprising that Texas has a fairly normal distribution of respondents with high BP, as it has the most cities recorded at 47, and therefore a large sample size. I am, however,surprised that New Mexico has a very normal distribution around about 26 as it only has a few cities and therefore a smaller sample size. I am interested to know what factors contribute to OK having a higher percentage of its population with high BP, and what contributes to NM having a lower percentage,